The table shows the most frequent special characters with its absolute frequency. Some of the symbols will give information about typography (“What kind of quotation marks are used most frequently?”), currency symbols will give information about the amount of finance texts.
Quality: Are there unexpected symbols with high frequency? If this is the case, there might be problems in character conversion, non-textual parts in the corpus etc.
If the number of full stops is the largest in the table, i.e. bigger than the number of sentences, then there are sentences including internal full stops. While this may happen in rare cases in sentences containing direct speech, the main reason is found in unknown abbreviations. In this case, the tokenizer separates the full stop from the remaining part of the abbreviation. This may result in difficulties for a POS-tagger to be applied later.
In a standard database, the word list contains special characters for the word IDs 1 ... 100. Here we find punctuation marks, parentheses, mathematical operators etc. The following additional special characters are included:
%^%: A special symbol for “Begin of sentence”. Application: One can find typical sentence beginnings as right neighbor co-occurrences of his symbol.
%$%: A special symbol for “End of sentence”.
Of course, the frequencies of the symbols %^% and %$% should be equal and give the number of sentences in the corpus.
select word, freq from words where 101>w_id and freq>9 order by freq desc;
Should we add a relative frequency to the table? While a relative frequency “per million running words” is the typical one, “per 1000 sentences” might be more convenient here.
Frequencies for letters
Frequencies for digits and numbers
Alphabet in the corpus